Multivariate Regression
Previously you learned how make predictions of
values of a dependent variable based on the values of an independent
variable. Remember when we predicted the Leave vote share across
constituencies based on the mean age? Or the Leave vote
share based on the percentage of foreign-born residents? In both cases,
we had one variable to predict the Leave vote
share. We call these variables predictors,
regressors or independent variables.
What was common in both of our models was that the variable we were
trying to predict (the regressand or dependent
variable) was the Leave vote share. Both age and
share of foreign-born residents seemed to be strongly related with the
dependent variable.
But it might well be that there is an omitted factor
that relates to both Leave vote share and the mean age (or
the share of foreigners)! Think of ice cream sales and deaths by
drowning. You are likely to find a strong correlation between
the two but is ice cream sales really causing deaths by drowning? No, of
course both are correlated with summer temperatures. Or you might be
interested in proving that going to the doctor makes you healthier one
year later. After collecting your data, you find no effect. But this
might be because people who go to the doctor may be less healthy in the
first place, so we’re not comparing like with like. When you take into
account prior health conditions, you are more likely to find a positive
effect of visiting the doctor on future health outcomes.
In regression analysis, we control for factors that
are related to both the dependent and independent variable. We
say that you control or adjust for a variable when you add to your model
one or more variables, so that these can be kept constant when
you interpret the relationship between your dependent and independent
variables of interest. In the above example, we need to control for
prior health conditions, as it is related to both the likelihood to go
to the doctor and future health conditions. Likewise, when predicting
death by drowning, you add summer to your model, as it is correlated
with both deaths by drowning and ice-cream sales. In both cases you
identify a confounder, something that is causally
related to both the dependent and independent variables.
• A confounder is a pre-treatment variable that
influences both your dependent variable \(Y\) and your main independent variable of
interest \(X\).
• An omitted variable is any variable that is
omitted from the regression, pre- or postdetermined.
Remember that in bivariate linear regression analysis, we fit a
single line which minimises the sum of squared residuals through a
scatter plot. Similarly, in multivariate linear regression analysis, we
fit a (hyper)plane through a multi-dimensional cloud of data
points - and this (hyper)plane is, by a similar logic, the one
that minimises the sum of squared residuals. The simplest form of
multivariate regression has one dependent and two independent variables.
You can increase the number of your independent variables as long as
their inclusion is motivated by theory; they are
predetermined (i.e. they are observable before the treatment
occurs/the outcome is observed), they can be expected to correlate
with both the dependent and the independent variable of interest;
and there are sufficient degrees of freedom (simply speaking,
number of observations is higher than the number of independent
variables in the model).
Last week, our models where age predicted the Leave vote
had some unexplained variation in the dependent variable – that is,
there was variation in our dependent variable that was not explained by
our independent variable. By adding one or more independent variables,
also known as covariates, you reduce the sum of the squared
residuals and thus your R-squared increases. However, the goal of
multivariate regression is to address confounding
factors, so that we can better estimate the effect our
independent variable of interest has on the dependent variable. Thus,
an increase in the R squared does not in itself necessarily mean a
better regression model, as it is no indication that we are
actually explaining the effect of a given predictor \(X\) on our outcome \(Y\). In the social sciences, we
are typically not so much interested in predicting or fully determining
\(Y\), but rather in estimating
possible connections between \(X\) and
\(Y\). This is why assessing the
relationship between \(X\) and \(Y\), and the confidence we have in its
causality is our chief aim - we are not trying to
increase the R squared! The equation for multivariate regression is
similar to that for bivariate regression, with the difference that you
add multiple betas and \(X_p\)
independent variables:
\[
Y_i=\alpha+\beta_1 X_{i 1}+\beta_2 X_{i 2}+\ldots+\beta_p X_{i
p}+\epsilon_i
\]
where \(Y_{i}\) is a vector of the
values of the dependent variable for each observation \(i\); \(X_{i1}\) through \(X_{ip}\) are the values for each
observation \(i\) of distinct
independent variables; \(\alpha\) is
the predicted value of \(Y\) ( \(\hat{Y}\) ) when all of the independent
variables (\(X_1\) through \(X_p\)) are equal to zero; \(\beta_1\) through \(\beta_p\) are the estimated regression
coefficients; \(\epsilon_i\) is the
idiosyncratic error term. Each regression coefficient represents the
change in \(\hat{Y}\) relative to a one
unit change in the respective independent variable holding other
variables constant. In the multiple regression situation, \(\beta_1\), for example, is the change in
\(Y\) relative to a one unit change in
\(X_1\), holding all other independent
variables constant (i.e., when the remaining independent variables are
held at the same value or are fixed).
In R, the function for multivariate linear regression is
lm(), with each predictor being added to the right of the
formula with a plus sign:
lm(data = ... , dependent variable ~ independent variable_1
+ independent variable_2 + independent variable_3)
Working with Observational Data
Let’s return to our data set from before. Remember how we found that
local authorities in London “stood out” as outliers due to their
especially low share of Leave vote? Now we are going to
explore further regional variation, and see how it combines with other
variables to explain the distribution of the Leave vote. We
start by loading in the data (save the dataframe preliminarily as
brexit_complete), and installing the packages
scatterplot3d and stargazer which we are going
to need later on:
setwd("/Users/your_username/Documents/DataAnalyisR/script4")
brexit_complete <- read.csv("brexit_data.csv")
install.packages("scatterplot3d")
install.packages("stargazer")
library("scatterplot3d")
library("stargazer")
Here is a description of the variables we are going to work with
today:
Area |
Name of the local authority |
Region |
Name of the region that contains the local authority |
Percent_Leave |
Percentage Leave votes |
age_mean |
The mean age of the area |
Bachelors_deg_percent |
The percentage of people with bachelor’s degrees in an area |
Health_very_good_percent |
The percentage of people with good health in the area |
Remember how the NAs (missing data) kept getting in the
way? Today, for simplicity, we just get rid of them, creating a new
dataframe called simply brexit only with the observations
with all complete entries:
brexit <- na.omit(brexit_complete)
When we remove observations, we always want to pay attention to
which observations we’re removing. Are all local authorities
equally likely to have missing data? Let’s look at the
Region variable in the original dataset and in the new
dataset
table(brexit_complete$Region)
table(brexit$Region)
We can use the merge() function to see the count for
each region from the two tables side by side. First, we want to turn the
tables into dataframes with the as.data.frame() command.
This will return a dataframe with the names of the regions as the
variable Var1, and their frequencies as a variable
Freq. Then we merge the two dataframes into a single one,
passing the argument by= to indicate the column by which we
want to merge the two objects (in this case Var1), as well
as the argument all = TRUE to make sure we are keeping all
possible values of Var1 in the merged dataframe.
count_old <- as.data.frame(table(brexit_complete$Region))
count_new <- as.data.frame(table(brexit$Region))
merge(count_old, count_new, by = "Var1", all = TRUE)
So, we have removed all local authorities in Scotland and
Northern Ireland, plus a few observations here and there in other
regions. Why can this be a problem? Selection bias. Our
sample is not representative of the population, because we are excluding
observations non-randomly.
How do we get around this? If this were the dataset you are using for
your thesis, you would want to collect more data to paint an accurate
picture of Leave votes on the national level; as this is
just an exercise, we can proceed further, as far as we understand that
all our conclusions concern an incomplete subsample of English and Welsh
local authorities only. This may not be great but more importantly it is
absolutely imperative to be clear what our results are telling us - and
where their limitations are!
First of all, let’s get rid of our original dataframe
brexit_complete, so we don’t get confused. We will only be
using the brexit dataframe subset from now on.
rm(brexit_complete)
Categorical Variables as Predictors
What would be your best prediction of Leave vote for a
local authority knowing only the region in which they are located?
tapply(brexit$Percent_Leave, brexit$Region, mean)
As we noticed before, local authorities in London ‘stand out’ as
particularly low on their share of the Leave vote. How
would we quantify this effect in a regression setting? We can try and
reproduce what we have learnt with continuous variables by creating a
‘London dummy’ variable that takes the value of 1 if the local
authority is located in London, and 0 otherwise. The effect of a
one-unit change (from 0 to 1) then is tantamount to comparing non-London
with London observations:
brexit$ldn_dummy <- NA
brexit$ldn_dummy <- ifelse(brexit$Region == "London", 1 , 0 )
lm_ldn <- lm(Percent_Leave ~ ldn_dummy, data = brexit)
summary(lm_ldn)
How do we interpret this substantively? A local authority in London
can be predicted to have a Leave vote of 56.1403 (the intercept) +
(−16.3347 × 1) (the slope of the ldn_dummy variable) =
39.8056. We add (−16.3347 × 1) to the intercept because all London local
authorities take the value of 1 for our ldn_dummy variable.
A local authority outside of London takes the value of 0 for
ldn_dummy, so the predicted Leave vote in this
model is 56.1403 + (−16.3347 × 0) = 56.1403 - that is, exactly the
intercept.
What if we want to predict the Leave Vote in non-London
local authorities more accurately, given that we have information on the
different regions they are located in? Let’s now create one more dummy
for the West Midlands, the most ‘Brexit-y’ region in our sample, and
add this variable to our lm_ldn model, to estimate
our first regression with two independent variables:
brexit$wm_dummy <- NA
brexit$wm_dummy <- ifelse(brexit$Region == "West Midlands", 1 , 0 )
lm_ldn_wm <- lm(Percent_Leave ~ ldn_dummy + wm_dummy, data = brexit)
summary(lm_ldn_wm)
What would our best guess of Percent_Leave vote be for a
local authority, knowing only that it’s in the West Midlands? It would
be the intercept plus the slope for wm_dummy because these
are observations that take the value of 1 for wm_dummy and
0 for ldn_dummy, hence 55.7091 + (−15.9035 × 0) + (4.5893 ×
1) = 60.2984. And if it’s in London? It would be the intercept plus the
slope for ldn_dummy because these are observations that
take the value of 0 for wm_dummy and 1 for
ldn_dummy, hence 55.7091 + (−15.9035 × 1) + (4.5893 × 0) =
39.8056. And if it’s neither in London nor in the West Midlands?
To regress on a categorical variable, we could continue adding
dummies for each possible value (“East Midlands”, “North East”, “Wales”
etc.) our predictor can take, or simply let R do it all at once:
lm_region <- lm(Percent_Leave ~ Region, data = brexit)
summary(lm_region)
At this stage, this is no different from computing
subgroup means. As a local authority can only take a value of 1 for
one of the categories, and by definition a categorical variable
will take on 0 for all others (e.g. it can’t be both London and
West Midlands), the predicted value of Percent_Leave will
be the intercept plus the coefficient for the Region the
local authority is in.
Note also that in the regression output R is apparently ‘dropping’
one Region: East. This is because local authorities in the Region ‘East’
will take a value of 0 for all other dummies, so our best guess for the
outcome in the East is exactly equal to the intercept. When we
include categorical variables in a regression, this will always happen:
one category will serve as reference category and the
coefficients for the other categories will indicate the expected
difference between that category and the reference category. Therefore,
in the regression above, we can interpret the value of the intercept
56.941 as the predicted value of Percent_Leave for local
authorities in the East, and the coefficient −17.135 as the expected
difference in Percent_Leave between a local authority in
the East and a local authority in London.
By default, the reference category of a character variable in R is
the first one alphabetically. This might not always be the one that
provides the most intuitive or helful results. You can change the
reference category by recoding the variable as a factor variable with
as.factor(). In a factor variable, each possible value our
variable takes is stored as an integer (1, 2, 3 etc.) as well as a
character. Then you can change the levels with the
relevel() function. In this way, R will use as reference
category the variable with level = 1.
brexit$Region <- as.factor(brexit$Region)
levels(brexit$Region)
brexit$Region <- relevel(brexit$Region, ref = "Wales")
levels(brexit$Region)
lm_region <- lm(Percent_Leave ~ Region, data = brexit)
summary(lm_region)
Does changing our reference category change anything about
our predictions? Why or why not?
Modelling and Visualising the Effects of a Dummy and a Continuous
Variable
In the previous script, you found a positive effect of mean age of a
local authority on the Leave vote share in that area by
running a bivariate regression:
lm_age <- lm(data = brexit, Percent_Leave ~ age_mean)
summary(lm_age)
How good is our fit? Is there any variable that can help us reduce
the square of residuals further? Let’s see if our London dummy can help.
To look into this, we can now turn to the residuals of the
lm_age model (i.e. the distance between predicted and
actual values) and compare them for London and non-London local
authorities. We can do this graphically with a boxplot, or by regressing
the residuals on our ldn_dummy variable.
residuals <- residuals(lm_age)
boxplot(residuals ~ brexit$ldn_dummy)
lm(residuals ~ brexit$ldn_dummy)
The boxplot and the regression show that the residuals have
substantially lower values for local authorities in London than
elsewhere. This means that “London” local authorities tend to be below
the regression line for the bivariate lm_age model: that
is, they have lower values of Percent_Leave than predicted
on the basis of their mean age by our bivariate model. Thus, it is
probable that by using only age as a predictor, we’re not weeding out
variation in Percent_Leave due to the variation in regional
differences (“London-ness”, if you will) in our sample. To account for
the effects of both age and London
simultaneously, we use a multivariate model:
lm_ldn_age <- lm(Percent_Leave ~ ldn_dummy + age_mean, data = brexit)
summary(lm_ldn_age)
We interpret these coefficients as follows: a 1-year increase in mean
age of a local authority leads to a 0.7063 increase in predicted
Leave vote share independent of the effect of whether
the local authority is in London or outside London. A local
authority being in London reduces the predicted Leave vote
share by 12.7740 percentage points, independent of the effect of
local authority mean age. So, our best guess for a local authority
in London with a mean age of 50 would be 27.4593 + (−12.7740 × 1) +
(0.7063 × 50). What’s our best guess for a local authority outside
London with a mean age of 50?
Now, compare the two bivariate models lm_ldn and
lm_age with our multivariate model
lm_ldn_age.
stargazer(lm_ldn, lm_age, lm_ldn_age, type = "html",
out = "ldn_age_models.html")
What happened to our coefficients? What happened to our \(R^2\)?
When we work with multivariate regressions, the more predictors we
add to our model, the bigger our R squared gets. This is by design.
However, this may not tell us much about the relative performance of
various models: you could just throw in hundreds of variables and get
marginally higher and higher values of \(R^2\). That’s why it’s often more useful to
look at the adjusted R squared goodness-of-fit
coefficient, which adds a penalty for each additional predictor when
calculating the model fit:
\[
A d j . R^2=1-\left(1-R^2\right) \frac{n-1}{n-k-1}
\]
where \(k\) refers to the number of
predictors in the regression model.
Now let’s try and visualise what we have found in the
lm_ldn_age model. What are the slope and the intercept for
the function that predicts values of Percent_Leave for
observations outside London? What are the slope and the intercept for
observations in London?
plot(brexit$age_mean, brexit$Percent_Leave)
abline(a = 27.45934, b = 0.7062648)
abline(a = 27.45934 + (-12.77396),
b= 0.7062648, col = "red")
# note: abline takes two parameters, a = intercept and b = slope;
# because the ldn_dummy variable only takes the values of 0 and 1,
# we can think of the regression line for London observations as a
# line that has as intercept the intercept plus the coefficient for
# ldn_dummy and as the slope of age.
# Or, more elegantly:
plot(brexit$age_mean, brexit$Percent_Leave)
abline(a = lm_ldn_age$coefficients[1], b = lm_ldn_age$coefficients[3])
abline(a = lm_ldn_age$coefficients[1]+lm_ldn_age$coefficients[2],
b= lm_ldn_age$coefficients[3], col = "red")
points(brexit$age_mean[brexit$ldn_dummy == 1],
brexit$Percent_Leave[brexit$ldn_dummy == 1], col = "red")
Technically, what the multivariate regression is doing is not fitting
“two lines” but rather it is fitting a plane, defined by three
dimensions: the London dummy, mean age and
Percent Leave. The logic is exactly the same as the
regression lines we plotted in the previous script: it’s the plane that
minimises the squares of residuals (the math is a bit trickier
though).
scatterplot3d(brexit$ldn_dummy, brexit$age_mean, brexit$Percent_Leave,
color = brexit$ldn_dummy+1)$plane3d(lm_ldn_age)
The 3-D plot shows Percent Leave on the Z-axis (vertical
height), mean age on the Y-axis (depth in perspective), and
London dumm on the X-axis (horizontal length). Note that
the latter only takes values of 0 and 1 so the “cloud” of observations
is effectively flattened either on the right (London local authorities)
or on the left (non-London local authorities) faces of the cube. The two
lines corresponding to the left and right sides of the plane running
through them are inclined according to the slope of age. The intercept
of the plane with the Percent_Leave vertical axis on the
left, where the London dummy is 0, represents the intercept of our
model. The intercept of the plane with the Percent_Leave
vertical axis on the right, where the London dummy is 1, represents the
intercept of our model plus the (negative) coefficient for
ldn_dummy.
Modelling the Effects of Two Continuous Variables and Regression
Anatomy
Now that we have visualised the basic principle of a multivariate
regression, let’s generalise this to the case of two continuous
independent variables, and look further into the computations going on
“behind the scenes” when we call the lm() function with two
regressors.
What other things may be related to Leave Vote?
Education is something that shows up continuously in the analysis of
Brexit voting patterns, and a great candidate for a potential ‘omitted
variable’ confounding our regression models. That is, we may have reason
to believe that some of the variation in the Leave Vote
share that is being explained e.g. with variation in mean age (or
regional variation) actually is to be attributed to variation in the
omitted variable, education.
We can test for this by checking how the residuals of our original
lm_age model correlate with education levels. To
operationalise education, we are going to use the
Bachelors_deg_percent (share of residents with at least an
undergraduate degree). We proceed in two steps:
First, re-run the linear regression without the
potential omitted variable. Here, regress the Leave vote
share on mean age only.
Then, inspect the relationship between the residuals from that
regression and the potentially omitted variable, by either:
• (i) plotting them against the potential omitted variable (here,
education),
• (ii) calculating the correlation between the residuals and the
potential omitted variable, or
• (iii) running a regression of the residuals on the potential
omitted variable.
A positive/negative slope, a strong correlation, or a large
coefficient estimate indicate that the potential omitted variable
(education) and the error term from the bivariate regression are
correlated i.e. the model likely suffers from omitted variable
bias.
Reveal Answer
Let’s go back to our original age-only bivariate model, and extract
the residuals with the residuals() function.
lm_age <- lm(data = brexit, Percent_Leave ~ age_mean)
residuals <- residuals(lm_age)
Now let’s look at the relationship between the residuals of our
age-only model and the education variable:
plot(residuals ~ brexit$Bachelors_deg_percent)
cor(residuals, brexit$Bachelors_deg_percent)
lm(data = brexit, residuals ~ Bachelors_deg_percent)
Levels of education are strongly and negatively related with the
residuals of our bivariate regression, so we would want to add it
alongside age in our model:
lm_age_edu <- lm(data = brexit, Percent_Leave ~ age_mean +
Bachelors_deg_percent)
summary(lm_age_edu)
In a multivariate regression, we interpret each \(\beta_i\) as the change in \(\hat{Y}\) relative to a one unit-change in
\(X_i\), holding all other
independent variables constant (i.e., when the remaining
independent variables are held at the same value or are fixed). How do
you substantively interpret each coefficient in this case?
Let’s compare this to the age-only and age-and-education models side
by side using stargazer, with some added parameters to make
it look nicer:
stargazer(lm_age, lm_age_edu, type = "html",
out = "age_edu_models.html",
covariate.labels = c("Mean Age",
"Share Degree-Holders",
"Intercept"),
digits = 3, dep.var.labels = "Leave Vote Share")
Can you add the bivariate model to the table? What has
happened to our coefficient for mean age? What has happened
to our \(R^2\)?
And this is what the regression plane looks like in three
dimensions.
scatterplot3d(brexit$age_mean,
brexit$Bachelors_deg_percent,
brexit$Percent_Leave,
angle = 250)$plane3d(lm_age_edu)
Now, let’s try and understand where our coefficients for age and
education are coming from. So far we have interpreted them as the effect
of an independent variable on the dependent variable net of the
effects of the other independent variable. More formally, we can
show that the multivariate regression coefficient of any individual
variable can be obtained by first netting out the effect of other
variable(s) in the regression model from both the dependent variable and
the independent variable. (This is known as the Frisch-Waugh
theorem.)
Let’s prove this with a step-by-step solution for the
age_mean coefficient of our multivariate model
lm_age_edu.
• (i) First, net out the effect of Bachelors_deg_percent
on age_mean by deriving the residuals of a bivariate
regression where age_mean is the dependent variable and
Bachelors_deg_percent is the independent variable:
partial1 <- lm(age_mean ~ Bachelors_deg_percent, data = brexit)
partial1
residuals_partial1 <- residuals(partial1)
• (ii) Secondly, net out the effect of
Bachelors_deg_percent on Percent_Leave by
deriving the residuals of a bivariate regression where
Percent_Leave is the dependent variable and
Bachelors_deg_percent is the independent variable:
partial2 <- lm(Percent_Leave ~ Bachelors_deg_percent, data = brexit)
partial2
residuals_partial2 <- residuals(partial2)
• (iii) Finally, regress the residuals of Percent_Leave
after netting out the effect of `Bachelors_deg_percent on
the residual variance in age_mean after netting out the
effect of Bachelors_deg_percent on it.
full_model <- lm(residuals_partial2 ~ residuals_partial1, data = brexit)
summary(full_model)
The coefficient is, indeed, the same as the age_mean
coefficient in our multivariate model fit lm_age_edu. What
we’ve learnt is that, in short, we can break down any multivariate
regression into an iteration of bivariate regressions.
How would you reproduce the coefficient for
Bachelors_deg_percent from the same model using this
method?
A Final Exercise with our Brexit Dataset
First, create a dummy variable that takes the value of 1 if the local
authority has an above-average share of residents who report being in
very good health, and 0 otherwise.
Reveal Answer
brexit$hea_dummy <- NA
brexit$hea_dummy[brexit$Health_very_good_percent >
mean(brexit$Health_very_good_percent)] <- 1
brexit$hea_dummy[brexit$Health_very_good_percent <=
mean(brexit$Health_very_good_percent)] <- 0
Now regress Leave vote on this new dummy and the
London dummy we created earlier:
Reveal Answer
lm_ldn_hea <- lm(data = brexit, Percent_Leave ~ ldn_dummy +
hea_dummy)
summary(lm_ldn_hea)
• What Leave vote share would the model predict for a
local authority that has below-average residents in very good health and
is located outside London?
Reveal Answer
60.7085 #the value of the intercept
• What Leave vote share would the model predict for a
local authority that has above-average residents in very good health and
is located outside London?
Reveal Answer
60.7085 - 10.3918
#the intercept plus the "health dummy" coefficient
• What Leave vote share would the model predict for a
local authority that has below-average residents in very good health and
is located in London?
Reveal Answer
60.7085 - 12.8577
#the intercept plus the "London dummy" coefficient
• What Leave vote share would the model predict for a
local authority that has above-average residents in very good health and
is located in London?
Reveal Answer
60.7085 - 10.3918 - 12.8577
# the intercept plus the "London dummy" coefficient
# AND the "health dummy" coefficient.
Now compute the sample means of Percent_Leave for all
these subgroups.
Reveal Answer
mean(brexit$Percent_Leave[brexit$ldn_dummy == 0 & brexit$hea_dummy == 0])
mean(brexit$Percent_Leave[brexit$ldn_dummy == 1 & brexit$hea_dummy == 0])
mean(brexit$Percent_Leave[brexit$ldn_dummy == 0 & brexit$hea_dummy == 1])
mean(brexit$Percent_Leave[brexit$ldn_dummy == 1 & brexit$hea_dummy == 1])
#or, more elegantly:
tapply(brexit$Percent_Leave, list(London = brexit$ldn_dummy,
Health = brexit$hea_dummy), mean)
Why are our predicted values not the same as the
observed mean values? What’s the difference between this regression with
two dummies and our regression model with the London and West Midlands
dummies?
Plots in ggplot
As our plots become more complex and interesting, it might be worth
learning how to reproduce them ggplot2, which normally
takes fewer lines of code and employs a more intuitive language. More
resources on this package are in the additional material for script 2.
It’s entirely up to you whether you want to look into this - for
this class, plots in base R are perfectly fine.
To install and load the package:
install.packages(tidyverse)
library(tidyverse)
Horizontal boxplot of Leave Vote by region:
ggplot(data = brexit,
mapping = aes(x = Region, y = Percent_Leave, fill = Region)) +
geom_boxplot() + theme_minimal() + ylab("Percent Leave") +
xlab("Region") + coord_flip() + theme(legend.position = "n") +
ggtitle("Leave Vote by Region") +
theme(plot.title = element_text(hjust = 0.5))
For the 2-dimensional age andLondon scatterplot:
ggplot(data = brexit, mapping = aes(x = age_mean, y = Percent_Leave,
col = as.factor(ldn_dummy),
group = as.factor(ldn_dummy))) +
geom_point() +
scale_color_manual(values = c("blue", "orange"),
labels = c("Outside London", "In London"),
name = "Local Authority") +
geom_abline(intercept = 27.45934, slope = 0.7062648, col = "blue") +
geom_abline(intercept = 27.45934 + (-12.77396),
slope = 0.7062648, col = "orange") +
ylab("Percent Leave") + xlab("Mean Age") +
ggtitle("Leave Vote by Age and Location of Local Authority") +
theme(plot.title = element_text(hjust = 0.5)) + theme_minimal()
Some Notes on Randomisation and Confounding
Confounding variables – variables that are related
to both the independent variable and the dependent variable – are
perhaps the main challenge in empirical social science research.
Crucially, the direction of the bias that confounders introduce
is ambiguous ex ante: they can either inflate or deflate estimated
coefficients. Since a confounder may even be
unobservable (i.e. be a latent variable), it may be
difficult to control for it even if we know the true underlying data
generating process. If a potential confounder is not the only confounder
in the true model, then controlling for that confounder can actually
worsen one’s ability to draw valid causal inferences by reducing the
bias on one side (positive) but not the other (negative, or vice versa);
this will pull OLS estimates further from the true population
parameter.
For more on this, see Jason Seawright, “Regression-based
inference: a case study in failed causal assessment” (in Brady and
Collier, 2010, 2nd edition), as well as, Kevin Clarke, “The phantom
menance: omitted variable bias in econometric research” (2005).
Random assignment to treatment solves this problem
since observed and unobserved confounders are balanced between control
and treatment groups in expectation. In the exercise that follows for
instance, the treatment variable in our dataset is the timing that the
conditional cash transfer scheme (Progresa) was introduced in a
precinct (early or late). Early progresa precincts are our treatment
group. However, researchers must demonstrate that both treatment
and control groups are balanced with respect to salient pre-treatment
characteristics. This involves quantitative balance tests,
checking that the means of of the pre-treatment characteristics are
equivalent for treatment and control groups, but also qualitative
observations about how treatment was assigned and why it was (as-if) at
random.
For more on this, seee Dunning, “Natural experiments in the
social sciences” (2012).
Exercise: Do Targeted Government Programs Increase Pro-Incumbent
Voting?
The data for this week comes from: Ana de la O. (2013). ‘Do
Conditional Cash Transfers Affect Voting Behavior? Evidence from a
Randomized Experiment in Mexico.’ American Journal of Political
Science, 57:1, pp.1-14.
In this exercise, you will be revisiting some previous topics and
refine them using techniques learned in this script. Previously, you’ve
looked at treatment and control groups, descriptive statistics and
linear regression. Now, you’re going to combine these to consider how
well randomisation holds up in an actual experiment and estimate how a
supposedly randomly assigned treatment impacted political behavior.
The original study looks at whether conditional cash transfers to
poor citizens affect voting behavior, specifically whether they boost
turnout and support for the incumbent party. In the late 1990s, the
Mexican government began a program called Progresa that
transferred about 35USD per month to eligible poor families. Crucially,
poor families were randomly assigned to receive these cash
transfers either in September 1998 (early) or in January 2000 (late).
National elections were held in July 2000. The randomisation was
conducted at the village level–a government agency first randomly
selected the villages to receive early Progresa or the late
Progresa before all eligible families in that village received
the disbursement (early or late). In this dataset, we have information
at the precinct level (precincts may contain many villages), which is
the smallest unit for which electoral results are published. To
circumvent problems of comparability, every precinct in our dataset
contains only 1 village that was randomly assigned either early or late
Progresa.
The author, following the distributive politics literature, expected
this conditional cash transfer (CCT) program to mobilise voters, leading
to an increase in turnout and support for the incumbent party (PRI in
this case). The outcome variable is PRI vote share in the 2000 election
(pri2000v) and the treatment is early or late
Progresa (treatment == 1 if the village received
early Progresa).
You can download the data above. Set your working directory and read
in the .csv file. The names and descriptions of variables
in the data set are below. Each observation in the data represents a
precinct.
treatment |
Whether an electoral precinct contains a village where households
received Early Progresa |
pri2000v |
Official PRI vote share in the 2000 election |
pri1994 |
Total PRI votes in the 1994 presidential election |
avgpoverty |
Precinct Avg of Village Poverty Index |
pri2000s |
PRI votes in the 2000 election as a share of precinct population
above 18 |
t2000r |
Official turnout in the 2000 election |
t1994r |
Official turnout in the 1994 election |
pobtot1994 |
Total Population in the precinct |
pan1994 |
Total PAN votes in the 1994 presidential election |
prd1994 |
Total PRD votes in the 1994 presidential election |
pri1994s |
Total PRI votes in the 1994 election as a share of precinct
population above 18 |
pan1994s |
Total PAN votes in the 1994 election as a share of precinct
population above 18 |
prd1994s |
Total PRD votes in the 1994 election as a share of precinct
population above 18 |
pri1994v |
Official PRI vote share in the 1994 election |
pan1994v |
Official PAN vote share in the 1994 election |
prd1994v |
Official PRD vote share in the 1994 election |
votos1994 |
Total votes cast in the 1994 presidential election |
t1994 |
Turnout in the 1994 election as a share of precinct population above
18 |
t2000 |
Turnout in the 2000 election as a share of precinct population above
18 |
villages |
Number of villages in the precinct |
rm(list=ls())
setwd("~/Desktop/DataAnalysisR/script4")
cct <- read.csv("cct.csv") #load data and call it "cct", conditional cash transfer
summary(cct)
Task 1
The main outcome variable that you will be using here is
pri2000v: the incumbent PRI party’s official vote share in
the 2000 election measured at the precinct level. What variables might
be good predictors of PRI support in 2000?
• Run a regression of pri2000v on
avgpoverty. Interpret the coefficient.
• Next, run a regression of pri2000v on
pri1994v. Interpret the coefficient.
• How might the coefficients change when including both predictors in
a multivariate regression model? Why?
Task 2
If previous PRI support and average poverty are themselves
correlated, we may be estimating a biased relationship when using a
bivariate linear regression involving either.
• Are pri1994v and avgpoverty
correlated?
• Run a regression of pri200v on
avgpoverty. Save the residuals and plot the residuals
(y-axis) against the fitted values (x-axis). Add a regression line to
the plot that fits the relationship between the residuals and the fitted
values.
• Next, add a regression line to the plot that represents the
relationship between the residuals and pri1994v. Is there a
relationship in the residuals?
Task 3
How would including both average poverty and previous PRI support
affect the OLS estimates?
• Run a regression using both avgpoverty and
pri1994v to predict the PRI vote share in 2000. Interpret
the coefficients in light of previous estimates of the separate effects
of avgpoverty and pri1994v.
Task 4
You will now consider the main treatment in the study: whether or not
a precinct received Progresa early (treated) or late (control).
If treatment assignment is indepedent of pre-treatment characteristics,
then there should be no relationship between any of the pre-treatment
variables and treatment assignment. Assess whether this assumption holds
true.
• Identify the mean values of avgpoverty for the
treatment and control groups, respectively. Do they differ?
• Create an according box plot for the treatment group and one for
the control group. Do they differ?
Task 5
What is the effect of early treatment on support for the PRI in
2000?
• Run a regression of pri2000v on
treatment.
• Given what you just learned about the relationship between
treatment and avgpoverty, how would you expect
avgpoverty to affect the regression model? If the treatment
and control groups are balanced with respect to avgpoverty,
what should the coefficient be for avgpoverty in a
regression that also includes treatment?
• Regress pri2000v on treatment and
avgpoverty. Interpret the coefficients.
Task 6
In a natural experiment, treatment is independent of observed and
unobserved characteristics in expectation. Just as you did with
avgpoverty, evaluate this claim with respect to previous
support for the incumbent party (pri1994v). Use
tapply() and boxplot(). Are treatment and
control groups balanced?
• How would including the previous PRI variable affect the OLS
estimate for the treatment? Do you expect the coefficient on treatment
to change?
• Run a regression of pri2000v on treatment
and pri1994v. Interpret the coefficient. Does early
Progresa lead to greater support for the incumbent party?
Task 7
Finally, evaluate a model with key control variables in the dataset.
Namely, regress pri2000v on treatment,
pri1994v, pan1994v, prd1994v,
avgpoverty, t1994r and
population. How does the coefficient on the treatment
variable change when control variables are added? What effect did the
policy intervention have on support for the incumbent party?